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Abstract 

This paper presents an improvement to 
model learning when using multi-class Log- 
itBoost for classification. Motivated by the 
statistical view, LogitBoost can be seen as 
additive tree regression. Two important fac- 
tors in this setting are: 1) coupled classi- 
fier output due to a sum-to-zero constraint, 
and 2) the dense Hessian matrices that arise 
when computing tree node split gain and 
node value fittings. In general, this setting is 
too complicated for a tractable model learn- 
ing algorithm. However, too aggressive sim- 
plification of the setting may lead to degraded 
performance. For example, the original Log- 
itBoost is outperformed by ABC-LogitBoost 
due to the latter's more careful treatment of 
the above two factors. 

In this paper we propose techniques to ad- 
dress the two main difficulties of the Log- 
itBoost setting: 1) we adopt a vector tree 
(i.e., each node value is vector) that enforces 
a sum-to-zero constraint, and 2) we use an 
adaptive block coordinate descent that ex- 
ploits the dense Hessian when computing tree 
split gain and node values. Higher classifi- 
cation accuracy and faster convergence rates 
are observed for a range of public data sets 
when compared to both the original and the 
AB C-LogitBoost implementations . 
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1. Introduction 

Boosting is successful for both binary and multi-class 
classification (Freund & Schapire. 1995; Schapire & 
Singer, 1999). Among those popular variants, we are 
particularly focusing on LogitBoost (Friedman ct al., 
1998) in this paper. Originally, LogitBoost is moti- 
vated by statistical view (Friedman et al., 1998), where 
boosting algorithms consists of three key components: 
the loss, the function model, and the optimization al- 
gorithm. In the case of LogitBoost, these are the Logit 
loss, the use of additive tree models, and a stage-wise 
optimization, respectively. There are two important 
factors in the LogitBoost setting. Firstly, the posterior 
class probability estimate must be normalised so as to 
sum to one in order to use the Logit loss. This leads 
to a coupled classifier output, i.e., the sum-to-zero 
classifier output. Secondly, a dense Hessian matrix 
arises when deriving the tree node split gain and node 
value fitting. It is challenging to design a tractable 
optimization algorithm that fully handles both these 
factors. Consequently, some simplification and/or ap- 
proximation is needed. Friedman et al. (1998) pro- 
poses a "one scalar regression tree for one class" strat- 
egy. This breaks the coupling in the classifier output 
so that at each boosting iteration the model updat- 
ing collapses to K independent regression tree fittings, 
where K denotes the number of classes. In this way, 
the sum-to-zero constraint is dropped and the Hessian 
is approximated diagonally. 

Unfortunately, Friedman's prescription turns out to 
have some drawbacks. A later improvement, ABC- 
LogitBoost, is shown to outperform LogitBoost in 
terms of both classification accuracy and conver- 
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Figure 1. A newly added tree at some boosting iteration for 
a 3-class problem, (a) A class pair (shown in brackets) is 
selected for each tree node. For each internal node (filled) , 
the pair is for computing split gain; For terminal nodes 
(unfilled), it is for node vector updating, (b) The feature 
space (the outer black box) is partitioned by the tree in 
(a) into regions {Ri, R2, R^}. On each region only two 
coordinates are updated based on the corresponding class 
pair shown in (a). 



gence rate (Li, 2008; 2010a). This is due to ABC- 
LogitBoost's careful handling of the above key prob- 
lems of the LogitBoost setting. At each iteration, the 
sum-to-zero constraint is enforced so that only K — 1 
scalar trees are fitted for K —1 classes. The remaining 
class - called the base class - is selected adaptively 
per iteration (or every several iterations), hence the 
acronym ABC (Adaptive Base Class). Also, the Hes- 
sian matrix is approximated in a more refined manner 
than the original LogitBoost when computing the tree 
split gain and fitting node value. 

In this paper, we propose two novel techniques to ad- 
dress the challenging aspects of the LogitBoost set- 
ting. In our approach, one vector tree is added per 
iteration. We allow a K dimensional sum-to-zcro vec- 
tor to be fitted for each tree node. This permits us 
to explicitly formulate the computation for both node 
split gain and node value fitting as a, K dimensional 
constrained quadratic optimization, which arises as a 
subproblem in the inner loop for split seeking when 
fitting a new tree. To avoid the difficulty of a dense 
Hessian, we propose that for each of these subprob- 
lems, only two coordinates {i.e., two classes or a class 
pair) are adaptively selected for updating, hence the 
name AOSO (Adaptive One vS One) LogitBoost. Fig- 
ure 1 gives an overview of our approach. In Section 2.5 
we show that first order and second order approxima- 
tion of loss reduction can be a good measure for the 
quality of selected class pair. 

Following the above formulation, ABC-LogitBoost, al- 
though derived from a somewhat different framework 
in (Li, 2010a), can thus be shown to be a special case of 
AOSO-LogitBoost, but with a less flexible tree model. 
In Section 3.1 we compare the differences between the 
two approaches in detail and provide some intuition 



for AOSO's improvement over ABC. 

The rest of this paper is organised as follows: In Sec- 
tion 2 we first formulate the problem setting for Log- 
itBoost and then give the details of our approach. 
In Section 3 wc compare our approach to (ABC- 
)LogitBoost. In Section 4. experimental results in 
terms of classification errors and convergence rates are 
reported on a range of public datasets. 

2. The Algorithm 

We begin with the basic setting for the LogitBoost 
algorithm. For il'-class classification {K > 2), consider 
an N example training set {xi, yi}^i where denotes 
a feature value and i/i G {!,..., K} denotes a class 
label. Class probabilities conditioned on x, denoted 
by P = {pi, . ■ . ,Pk)'^ , are learned from the training 
set. For a test example with known x and unknown 
y, we predict a class label by using the Bayes rule: 
y = argmaXfePA;,fc = 1, 



.K. 



Instead of learning the class probability directly, one 
learns its "proxy" F = {Fi,...,Fk)^ given by the 
so-called Logit link function: 



Pk 



cxp(Ffc) 
Eliexp(Fj) 



(1) 



with the constraint J2k=i^k = (Friedman et al., 
1998). For simplicity and without confusion, we here- 
after omit the dependence on x for F and for other 
related variables. 

The F is obtained by minimizing a target function on 
training data: 



Loss = '^L{yi,Fi), 

i=l 



(2) 



where Fi is shorthand for F{xi) and L{yi,Fi) is the 
Logit loss for a single training example: 



K 

L{yt,F,) = -^Tifc logpik, 
fe=i 



(3) 



where rik = I if yt = k and otherwise. The proba- 
bility pik is connected to Fik via (1). 

To make the optimization of (2) feasible, a model is 
needed to describe how F depends on x. For exam- 
ple, linear model F = W'^x is used in traditional 
Logit regression, while Generalized Additive Model is 
adopted in LogitBoost: 
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where each a K dimensional sum-to-zero vec- 

tor, is learned by greedy stage- wise optimization. That 
is, at each iteration f^{x) is added only based on 
F = T.7=ifr Formally, 



loss: 



N 



frnix) = argmin^L(yi,Fi + f{xi)) 
f 1=1 



(5) 



This procedure repeats M times with initial condition 
F = 0. Owing to its iterative nature, we only need 
to know how to solve (5) in order to implement the 
optimization. 

2.1. Vector Tree Model 

The f{x) in (5) is typically represented by K scalar 
regression trees {e.g., in LogitBoost (Friedman et al., 
1998) or the Real AdaBoost.MH implementation in 
(Friedman et al., 1998)) or a single vector tree {e.g., the 
Real AdaBoost.MH implementation in (Kegl & Busa- 
Fekete, 2009)). In this paper, we adopt a single vec- 
tor tree. We further restrict that it is a binary tree 
{i.e., only binary splits on internal node are allowed) 
and the split must be vertical to coordinate axis, as in 
(Friedman et al., 1998) or (Li, 2010a). Formally, 



f{x)^Y.*,I{xeR,) 



(6) 



where {Rj}j^i describes how the feature space is par- 
titioned, while tj G with a sum-to-zero constraint 
is the node values/ vector associated with Rj. See Fig- 
ure 1 for an example. 

2.2. Tree Building 

Solving (5) with the tree model (6) is equivalent to 
determining the parameters {tj, Rj}^^^ at the m-th 
iteration. In this subsection we will show how this 
problem reduces to solving a collection of convex op- 
timization subproblems for which we can use any nu- 
merical method. Following Friedman's LogitBoost set- 
tings, here we use Newton descent^ . Also, we will show 
how the gradient and Hessian can be computed incre- 
mentally. 

We begin with some shorthand notation for the node 



^We use Newtown descent as there is evidence in 
(Li, 2010a) that gradient descent, i.e., in Friedmans's 
MART (Friedman, 2001), leads to decreased classification 
accuracy. 



NodeLoss{t;T) ^ ^ L{y„ Fi + t) 



(7) 



ti + ... + tK = Q, te 



1>K 



where X denotes the index set of the training examples 
on some either internal or terminal node {i.e., those 
falling into the corresponding region). Minimizing (7) 
is the bridge to {tj,Rj} in that: 

1. To obtain {tj} with given {Rj}, we simply take 
the minimizer of (7): 

tj ~ aigmm NodeLoss{t;Ij), (8) 
t 

where Ij denotes the index set for Rj . 

2. To obtain {Rj}, we recursively perform binary 
split until there are J-terminal nodes. 

The key to the second point is to explicitly give the 
node split gain. Suppose an internal node with n train- 
ing examples {n = N for the root node), we fix on some 
feature and re-index all the n examples according to 
their sorted feature values. Now we need to find the 
index n' with 1 < n' < n that maximizes the node 
gain defined as loss reduction after a division between 
the n'-th and {n' -\- l)-th examples: 



NodeGain{n') ~ N odeLoss{t* — 
{NodeLoss{tl;lL) + NodeLoss{t},;lR)) 



(9) 



where T = {l,...,n}, X]^ = {l,...,7i'} and Xji ~ 
{n' + 1, . . . ,n}; t* , t*j^ and are the minimizers of (7) 
with index sets X, Xl and Xf(, respectively. Generally, 
this searching applies to all features. The best division 
resulting to largest (9) is then recorded to perform the 
actual node split. 

Note that (9) arises in the context of an 0{N x D) 
outer loop, where D is number of features. However, a 
naive summing of the losses for (7) incurs an additional 
0{N) factor in complexity, which finally results in an 
unacceptable 0{N^D) complexity for a single boosting 
iteration. 

A workaround is to use a Newton descent method for 
which both the gradient and Hessian can be incremen- 
tally computed. Let g, H respectively be the K x I 
gradient vector and K x K Hessian matrix at t = 0. 
By dropping the constant NodeLoss{0;X) that is ir- 
relevant to t, the Taylor expansion of (7) w.r.t. t up 
to second order is: 



ti 



loss{t;X) = g'^t+^^t^Ht 



(10) 



tK ^0, te 
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By noting the additive separability of (10) 
and using some matrix derivatives, we have 



g = -Y.g. (11) 



H 



(12) 



9, = r.. Pi 



(13) 



H,=P p,pj (14) 



where the diagonal matrix P ~ diag{pii, . . . ^pix)- 
We then use (10) to compute the approximated node 
loss in (9). Thanks to the additive form, both (11) and 
(12) can be incrementally/decrementally computed 
in constant time when the split searching proceeds 
from one training example to the next. Therefore, the 
computation of (10) eliminates the 0{N) complexity 
in the naive summing of losses.^ 

2.3. Properties of Approximated Node Loss 

To minimise (10), we give some properties for (10) that 
should be taken into account when seeking a solver. 
We begin with the sum-to-zero constraint. The proba- 
bility estimate pk in the Logit loss (3) must be non-zero 
and sum-to-one, which is ensured by the link function 
(1). Such a link, in turn, means that p^ is unchanged 
by adding an arbitrary constant to each component in 
F. As a result, the single example loss (3) is invariant 
to moving it along an all-1 vector 1. That is. 



Liy„F, + cl)^L{y„F,), 



(15) 



where c is an arbitrary real constant (Note that 1 is, 
coincidentally, the orthogonal complement to the space 
defined by sum-to-zero constraint). This property also 
carries over to the approximated node loss (10): 

Property 1 loss{t;I) = loss{t + cl\I). 

This is obvious by noting the additive separability in 
(10), as well as that gjl = 0, 1^ Hil ~ holds since 
Pi is sum-to-one. 

For the Hessian, we have rank{H) < rank{Hi) by 
noting the additive form in (11). In (Li, 2010a) it 
is shown that detHi = by brute-force determinant 
expansion. Here we give a stronger property: 



In Real AdaBoost.MH, such a second order approxi- 
mation is not necessary (although possible, cf. (Zou et al., 
2008)). Due to the special form of the exponential loss 
and the absence of a sum-to-zero constraint, there exists 
analytical solution for the node loss (7) by simply setting 
the derivative to 0. Here also, the computation can be 
incremental/decremental. Since the loss design and Ad- 
aBoost.MH are not our main interests, we do not discuss 
this further. 



Property 2 Hi is a positive semi-definite matrix 
such that 1) rank{Hi) = k—1, where k is the number 
of non-zero elements in Pi; 2) 1 is the eigenvector for 
eigenvalue 0. 

The proof can be found in this paper's extended ver- 
sion (Sun et al., 2012). 

The properties shown above indicate that \) H is sin- 
gular, so that unconstrained Newton descent is not 
applicable here, and 2) rank{H) could be as high as 
K — 1, which prohibits the application of the stan- 
dard fast quadratic solver designed for low rank Hes- 
sians. In the following we propose to address this prob- 
lem via block coordinate descent, a technique that has 
been successfully used in training SVMs (Bottou & 
Lin, 2007). 

2.4. Block Coordinate Descent 

For the variable t in (10), we only choose two (the least 
possible number due to the sum-to-zero constraint) 
coordinates, i.e., a class pair, to update while keep- 
ing the others fixed. Suppose we have chosen the r- 
th and the s-th coordinate (how to do so is deferred 
to next subsection). Let t,. = t and t^ ~ —t be the 
free variables (such that i,- + tg =0) and t^ — for 
k ^ r, k s. Plugging these into (10) yields an un- 
constrained one dimensional quadratic problem with 
regards to the scalar variable t: 



loss{t) = g'^t -\- ]^ht^ 



(16) 



where the gradient and Hessian collapse to scalars: 



16/ 



= ^ (Pj,r(l ~Pi,r) +Pj,s(1 ~Pi,s) + 2p^^rPi,s) ■ 



lei 



(18) 

To this extent, we are able to obtain the analytical 
expression for the minimizcr and minimum of (16): 



t* = argmin /oss(t) = — — 
* h 



loss(t*) = 

^ ' 2h 



(19) 



(20) 



by noting the non-negativity of (18). 

Based on (19), node vector (8) can be approximated 



+ i-g/h) k = r 
-{-g/h) fc = s 
otherwise 



(21) 
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where g and h arc respectively computed by using (17) 
and (18) with index set Imj- Based on (20), the node 
gain (9) can be approximated as 

= (22) 

where g (or g^, g^) and h (or hj^^ hji) are computed 
by using (17) and (18) with index set I (or Xl, Ir)- 

2.5. Class Pair Selection 

In (Bottou & Lin, 2007) two methods for selecting 
(r, s) are proposed. One is based on a first order ap- 
proximation. Let tr and ts be the free variables and 
the rest be fixed to 0. For a t with sufficiently small 
fixed length, let tr ~ e and is = — e where e > is some 
small enough constant. The first order approximation 
of (10) is: 

loss{t) w loss{0) + g^t = loss{Q) — £{—gr — (~5s)) 

(23) 

It follows that the indices r, s resulting in largest decre- 
ment to (23) are: 



r = argmax{-5fc} 

fe 

s = argmin{-gfc} . 
k 



(24) 



Another method that can be derived in a similar way 
takes into account the second order information: 



r = arg max { —gk } 

k 



s = arg max 

k \ hr 



[qt ~ gkf 



hkk — 2h. 



■rk 



(25) 



Both methods arc 0{K) procedures that are better 
than the K x [K — l)/2 nai've enumeration. However, 
in our implementation we find that (25) achieves better 
results for AOSO-LogitBoost. 

Pseudocode for AOSO-LogitBoost is given in Algo- 
rithm 1. 

3. Comparison to (ABC-)LogitBoost 

In this section we compare the derivations of Logit- 
Boost and ABC-LogitBoost and provide some intu- 
ition for observed behaviours in the experiments in 
Section 4. 



3.1. ABC-LogitBoost 

To solve (5) with a sum-to-zero constraint, ABC- 
LogitBoost uses K — 1 independent trees: 



fk 



(26) 



Algorithm 1 AOSO-LogitBoost. v is shrinkage factor 
that controls learning rate. 



Fik = 0, k^l,...,K,i 
for m = 1 to il/ do 



1, ■ 



Pi.k 



cxp(-Fi.fc) 



I = 1, 



Obtain {-Rmj}j=i by recursive region partition. 
Node split gain is computed as (22), where the 
class pair (r, s) is selected using (25) . 
5: Compute {tmj}j=i by (21), where the class pair 
(r, s) is selected using (25) . 

6: Fi = Fi + vY/^^^t,njI{Xi e Rmj), i = 

1,...,N. 
7: end for 



In (Li. 2010a), the so-called base class b is selected by 
exhaustive search per iteration, i.e., trying all possible 
&, which involves growing K{K — 1) trees. To reduce 
the time complexity, Li also proposed other methods. 
In (Li, 2010c), b is selected only every several itera- 
tions, while in (Li, 2008), b is, intuitively, set to be 
the class that leads to largest loss reduction at last 
iteration. 

In ABC-LogitBoost the sum-to-zero constraint is ex- 
plicitly considered when deriving the node value and 
the node split gain for the scalar regression tree. In- 
deed, they are the same as (21) and (22) in this paper, 
although derived using a slightly different motivation. 
In this sense, ABC-LogitBoost can be seen as a spe- 
cial form of the AOSO-LogitBoost since: 1) For each 
tree, the class pair is fixed for every node in ABC, 
while it is selected adaptively in AOSO, and 2) A' — 1 
trees are added per iteration in ABC (using the same 
set of probability estimates {pijfli), while only one 
tree is added per iteration by AOSO (and {pi}fLi are 
updated as soon as each tree is added). 

Since two changes are made to ABC-LogitBoost, an 
immediate question is what happens if we only make 
one? That is, what happens if one vector tree is added 
per iteration for a single class pair selected only for 
the root node and shared by all other tree nodes, as in 
ABC, but the {pi}fLi are updated as soon as a tree is 
added, as in AOSO. This was tried but unfortunately, 
degraded performance was observed for this com- 
bination so the results are not reported here. 

From the above analysis, we believe the more flexi- 
ble model (as well as the model updating strategy) 
in AOSO is what contributes to its improvement over 
ABC, as seen section 4). 
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Table 1. Datasets used in our experiments. 



datascts 


K 


^features 


^training 


^test 


Pokcr525k 


10 


25 


525010 


500000 


Pokcr275k 


10 


25 


275010 


500000 


Pokerl50k 


10 


25 


150010 


500000 


PokcrlOOk 


10 


25 


100010 


500000 










ouuuuu 


r^oKerzoK 1 Z 


10 


25 


25010 


oUUUUU 


Covertype290k 


7 


54 


^yuoOb 


^yuoUfa 


Covertypcl45k 


7 


54 


145253 


zyuoUb 


Letter 


26 


16 


16000 


4000 


Letterl5k 


26 


16 


15000 


5000 


Letter2k 


26 


16 


2000 


18000 


Ijetter4K 


26 


16 


4000 


16000 


Pendigits 


10 


16 




3498 


Zipcode 


10 


256 


7291 


2007 


(a.k.a. USPS) 










Isolct 


26 


617 


6238 


1559 


Optdigits 


10 


64 


3823 


1797 


MnistlOk 


10 


784 


10000 


60000 


M- Basic 


10 


784 


12000 


50000 


M-Image 


10 


784 


12000 


50000 


M-Rand 


10 


784 


12000 


50000 


M-Noiscl 


10 


784 


10000 


2000 


M-Noise2 


10 


784 


10000 


2000 


M-Noise3 


10 


784 


10000 


2000 


M-Noise4 


10 


784 


10000 


2000 


M-Noisc5 


10 


784 


10000 


2000 


M-Noisc6 


10 


784 


10000 


2000 



3.2. LogitBoost 

In the original LogitBoost (Friedman et al., 1998), the 
Hessian matrix (14) is approximated diagonally. In 
this way, the / in (5) is expressed by K uncoupled 
scalar tress: 



j 



fc = l,2,.. 



(27) 



with the gradient and Hessian for computing node 
value and node split gain given by: 



9k 



-PiM), i^k = - } ,PiM^ -Pi,k)- 

(28) 

Here we use the subscript k for g and h to emphasize 
the fc-th tree is built independently to the other K — 1 
trees {i.e., the sum-to-zero constraint is dropped). Al- 
though this simplifies the mathematics, such an ag- 
gressive approximation turns out to harm both clas- 
sification accuracy and convergence rate, as shown in 
Li's experiments (Li, 2009). 

4. Experiments 

In this section we compare AOSO-LogitBoost with 
ABC-LogitBoost, which was shown to outperform 
original LogitBoost in Li's experiments (Li, 2010a; 
2009). We test AOSO on aU the datasets used 
in (Li, 2010a; 2009), as listed in Table 1. In the 
top section are UCI datasets and in the bottom are 
Mnist datasets with many variations (see (Li, 2010b) 



Table 2. Test classification errors on MnistlOk. In each 
J-v entry, the first entry is for ABC-LogitBoost and the 
second for AOSO-LogitBoost. Lower one is in bold. 





V = 0.04 


V = 0.06 


V = 0.08 


V = 0.1 


4 


2630 


2515 


2600 


2414 


2535 


2414 


2522 


2392 


6 


2263 


2133 


2252 


2146 


2226 


2146 


2223 


2134 


8 


2159 


2055 


2138 


2046 


2120 


2046 


2143 


2055 


10 


2122 


2010 


2118 


1980 


2091 


1980 


2097 


2014 


12 


2084 


1968 


2090 


1965 


2090 


1965 


2095 


1995 


14 


2083 


1945 


2094 


1938 


2063 


1938 


2050 


1935 


16 


2111 


1941 


2114 


1928 


2097 


1928 


2082 


1966 


18 


2088 


1925 


2087 


1916 


2088 


1916 


2097 


1920 


20 


2128 


1930 


2112 


1917 


2095 


1917 


2102 


1948 


24 


2174 


1901 


2147 


1920 


2129 


1920 


2138 


1903 


30 


2235 


1887 


2237 


1885 


2221 


1885 


2177 


1917 


40 


2310 


1923 


2284 


1890 


2257 


1890 


2260 


1912 


50 


2353 


1958 


2359 


1910 


2332 


1910 


2341 


1934 



for detailed descriptions).'^ To exhaust the learn- 
ing ability of (ABC-)LogitBoost, Li let the boost- 
ing stop when either the training converges (i.e., the 
loss (2) approaches 0, implemented as < 10~^^) or 
a maximum number of iterations, M, is reached. 
Test errors at last iteration are simply reported since 
no obvious over-fitting is observed. By default, 
M = 10000, while for those large datasets [Cover- 
type290k, Poker 525k, Pokder275k, Poker 150k, 
PokerlOOk) M = 5000 (Li, 2010a; 2009). We adopt 
the same criteria, except that our maximum iterations 
Maoso = Mabc, where K is the number of 

classes. Note that only one tree is added at each iter- 
ation in AOSO, while if - 1 are added in ABC. Thus, 
this correction compares the same maximum number 
of trees for both AOSO and ABC. 

The most important tuning parameters in LogitBoost 
are the number of terminal nodes J, and the shrink- 
age factor V. In (Li, 2010a; 2009), Li reported results 
of (ABC-)LogitBoost for a number of J-v combina- 
tions. We report the corresponding results for AOSO- 
LogitBoost for the same combinations. In the follow- 
ing, we intend to show that for nearly all J-v com- 
binations, AOSO-LogitBoost has lower classifi- 
cation error and faster convergence rates than 
ABC-LogitBoost. 

4.1. Classification Errors 

Table 2 shows results of various J-v combinations for a 
representative datasets. Results on more datasets can 
be found in this paper's extended version (Sun et al., 
2012). 

In Table 3 we summarize the results for all datasets. In 
(Li, 2010a), Li reported that ABC-LogitBoost is insen- 
sitive to J, V on all datasets except for Poker25kTl 



Code and data are available at 
http://ivg. au.tsinghua.edu. cn/index.php?n=People.PengSun 
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Table 3. Summary of test classification errors. Lower one is in bold. Middle panel: J — 20,v — 0.1 except for Poker25kTl 
and Poker25kT2 on which J, v are chosen by validation (See the text in 4.1); Right panel: the overall best. Dash "-" 
means unavailable in (Li, 2010a) (Li, 2009). Relative improvements (R) and P- values (pv) are given. 



Datascts 


^tcsts 


ABC 


AOSO 


R 


pv 


ABC 


AOSO- 


R 


pv 


Pokcr525k 


500000 


1736 


1537 


0.1146 


0.0002 










Pokcr275k 


500000 


2727 


2624 


0.0378 


0.0790 










PokcrlSOk 


500000 


5104 


3951 


0.2259 


0.0000 










PokcrlOOk 


500000 


13707 


7558 


0.4486 


0.0000 










Poker25kTl 


500000 


37345 


31399 


0.1592 


0.0000 


37345 


31399 


0.1592 


0.0000 


Poker25kT2 


500000 


36731 


31645 


0.1385 


0.0000 


36731 


31645 


0.1385 


0.0000 


Covertype290k 


290506 


9727 


9586 


0.0145 


0.1511 










Covertypel45k 


290506 


13986 


13712 


0.0196 


0.0458 










Letter 


4000 


89 


92 


-0.0337 


0.5892 


89 


88 


0.0112 


0.4697 


LetterlSk 


5000 


109 


116 


-0.0642 


0.6815 










Letter4k 


16000 


1055 


991 


0.0607 


0.0718 


1034 


961 


0.0706 


0.0457 


Letter2k 


18000 


2034 


1862 


0.0846 


0.0018 


1991 


1851 


0.0703 


0.0084 


Pendigits 


3498 


100 


83 


0.1700 


0.1014 


90 


81 


0.1000 


0.2430 


Zipcode 


2007 


96 


99 


-0.0313 


0.5872 


92 


94 


-0.0217 


0.5597 


Isolet 


1559 


65 


55 


0.1538 


0.1759 


55 


50 


0.0909 


0.3039 


Optdigits 


1797 


55 


38 


0.3091 


0.0370 


38 


34 


0.1053 


0.3170 


MnistlOk 


60000 


2102 


1948 


0.0733 


0.0069 


2050 


1885 


0.0805 


0.0037 


M-Basic 


50000 


1602 


1434 


0.1049 


0.0010 










M-Rotato 


50000 


5959 


5729 


0.0386 


0.0118 










M-Image 


50000 


4268 


4167 


0.0237 


0.1252 


4214 


4002 


0.0503 


0.0073 


M-Rand 


50000 


4725 


4588 


0.0290 


0.0680 










M-Noiscl 


2000 


234 


228 


0.0256 


0.3833 










M-Noiso2 


2000 


237 


233 


0.0169 


0.4221 










M-Noiso3 


2000 


238 


233 


0.0210 


0.4031 










M-Noisc4 


2000 


238 


233 


0.0210 


0.4031 










M-Noisc5 


2000 


227 


214 


0.0573 


0.2558 










M-Noiso6 


2000 


201 


191 


0.0498 


0.2974 











Table 4. #trees added when convergence on selected 
datasets. R stands for the ratio AOSO/ABC. 







MnistlOk 


M-Rand 


M-Imagc 


ABC 
R 


7092 
0.7689 


15255 
0.7763 


14958 
0.8101 




Lctterl5k 


Lcttcr4k 


Lettor2k 


ABC 
R 


45000 
0.5512 


20900 
0.5587 


13275 
0.5424 



and Poker25kT2. Therefore, Li summarized classifi- 
cation errors for ABC simply with J = 20 and v = 0.1, 
except that on Poker25kTl and Poker25kT2 er- 
rors are reported by using the other's test set as a 
validation set. Based on the same criteria we summa- 
rize AOSO in the middle panel of Table 3 where the 
test errors as well as the improvement relative to ABC 
are given. In the right panel of Table 3 we provide the 
comparison for the best results achieved over all J-v 
combinations when the corresponding results for ABC 
are available in (Li, 2010a) or (Li, 2009). 

We also tested the statistical significance between 
AOSO and ABC. We assume the classification error 
rate is subject to some Binomial distribution. Let z 
denote the number of errors and n the number of tests, 
then the estimate of error rate p = z/n and its vari- 
ance is p(l — p)/n. Subsequently, we approximate the 
Binomial distribution by a Gaussian distribution and 
perform a hypothesis test. The p- values are reported 
in Table 3. 

For some problems, we note LogitBoost (both ABC 



Table 5. #trees added when convergence on MnistlOk for 
a number of J-v combinations. For each J-v entry, the first 
number is for ABC, the second for the ratio AOSO/ABC. 









11 = 0.04 


V = 0.06 


V = 0.1 


J 




4 


90000 1.0 


90000 1.0 


90000 1.0 


J 




6 


90000 0.7740 


63531 0.7249 


38223 0.7175 


J 




8 


55989 0.7962 


38223 0.7788 


22482 0.7915 


J 




10 


39780 0.8103 


27135 0.7973 


16227 0.8000 


J 




12 


31653 0.8109 


20997 0.8074 


12501 0.8269 


J 




14 


26694 0.7854 


17397 0.8047 


10449 0.8160 


J 




16 


22671 0.7832 


11704 1.0290 


8910 0.8063 


J 




18 


19602 0.7805 


13104 0.7888 


7803 0.7933 


J 




20 


17910 0.7706 


11970 0.7683 


7092 0.7689 


J 




24 


14895 0.7514 


9999 0.7567 


6012 0.7596 


J 




30 


12168 0.7333 


8028 0.7272 


4761 0.7524 


J 




40 


9846 0.6750 


6498 0.6853 


3870 0.6917 


J 




50 


8505 0.6420 


5571 0.6448 


3348 0.6589 





and AOSO) outperforms other state-of-the-art classi- 
fier such as SVM or Deep Learning, {e.g., the test 
error rate on Poker is 40% for SVM and < 10% 
for both ABC and AOSO (even lower than ABC); on 
M-Image it is 16.15% for DBN-1, 8.54% for ABC 
and 8.33% for AOSO). See this paper's extended ver- 
sion (Sun et al.; 2012) for details. This shows that 
the AOSO's improvement over ABC does deserve the 
efforts. 

4.2. Convergence Rate 

Recall that we stop the boosting procedure if either 
the maximum number of iterations is reached or it 
converges (i.e. the loss (2) < 10~^^). The fewer trees 
added when boosting stops, the faster the convergence 
and the lower the time cost for either training or test- 
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Iteralions Iterations Iterations 



Figure 2. Errors vs. iterations on selected datasets and 
parameters. Top row: ABC (copied from (Li, 2010a)); 
Bottom row: AOSO (horizontal axis scaled to compensate 
the K — 1 factor). 

ing. We compare AOSO with ABC in terms of the 
number of trees added when boosting stops for the re- 
sults of ABC available in (Li, 2010a; 2009). Note that 
simply comparing number of boosting iterations is un- 
fair to AOSO, since at each iteration only one tree is 
added in AOSO and K - I in ABC. 

Results arc shown in Table 4 and Tabic 5. Except for 
when J-v is too small, or particularly difficult datasets 
where both ABC and AOSO reach maximum itera- 
tions, we found that trees needed in AOSO are typi- 
cally only 50% to 80% of those in ABC. 

Figure 2 shows plots for test classification error vs. 
iterations in both ABC and AOSO and show that 
AOSO's test error decreases faster. More plots for 
AOSO can be found in this paper's extended version 
(Sun et al., 2012). 

5. Conclusions 

We present an improved LogitBoost, namely AOSO- 
LogitBoost, for multi-class classification. Compared 
with ABC-LogitBoost, our experiments suggest that 
our adaptive class pair selection technique results in 
lower classification error and faster convergence rates. 
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